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ABSTRACT 

Analysis of large data collections using popular machine 
learning and statistical algorithms has been a topic of in¬ 
creasing research interest. A typical analysis workload con¬ 
sists of applying an algorithm to build a model on a data 
collection and subsequently refining it based on the results. 

In this paper we introduce model materialization and in¬ 
cremental model reuse as first class citizens in the execution 
of analysis workloads. We materialize built models instead 
of discarding them in a way that can be reused in subsequent 
computations. At the same time we consider manipulating 
an existing model (adding or deleting data from it) in order 
to build a new one. We discuss our approach in the context 
of popular machine learning models. We specify the details 
of how to incrementally maintain models as well as outline 
the suitable optimizations required to optimally use models 
and their incremental adjustments to build new ones. We 
detail our techniques for linear regression, naive bayes and 
logistic regression and present the suitable algorithms and 
optimizations to handle these models in our framework. 

We present the results of a detailed performance evalua¬ 
tion, using real and synthetic data sets. Our experiments 
analyze the various trade offs inherent in our approach and 
demonstrate vast performance benefits. 


1. INTRODUCTION 

Analytics on large collections of data is a topic of vast in¬ 
terest in recent years. Although analysis of data was always 
central in the data management community, the prevalence 
of various machine learning and statistical systems/packages 
has corroborated to the interest. As a result several recent 
lines of research across communities aim to engineer popular 
machine learning techniques both at the algorithmic as well 
as the systems level to scale in large data collections [2 13 
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Data analytics tasks however, are rarely run in isolation. 
Typically an analysis workload consists of applying an algo¬ 


rithm (e.g., machine learning algorithm or statistical opera¬ 
tion) on a large data set building a model and subsequently 
refine the operation based on the results of previous steps. 
For example consider building a model (e.g., regression op¬ 
eration) on a data set produced for the first two weeks of a 
month (e.g., sales data as it relates to various traffic param¬ 
eters and promotions activities on a web site). Based on the 
results of the operation (e.g., regression parameters, error, 
etc) one decides to run an additional regression operation 
for the data set representing the entire month. Alternatively 
during a data exploration task, one creates a data model for 
a year worth of data collected for a service, only to decide 
to drill down and build a model for the second month of the 
year that seems to present an anomaly for the given model 
fit. 

It is evident that analysis tasks can be part of an anal¬ 
ysis workload and rarely run in isolation. Moreover, ex¬ 
ploratory tasks, may involve extending or rehning previously 
completed tasks. As a result, this behavior reveals certain 
dependencies among the steps of an analysis workload. Such 
dependencies expose opportunities for work sharing across 
tasks. For example one may be able to reuse the model 
for the first two weeks of the month instead of building the 
model for the entire month from scratch. Such reuse could 
be achieved by incrementally updating the current model 
with additional data. Alternatively if the model for the 
subsequent two weeks of the month is available, the desired 
model for the month could be build by combining the two 
models as opposed building it from scratch. Such an option 
is advantageous as the models are already build and one sim¬ 
ply derives a new one without the need to access possibly 
large collections of data. In a similar fashion we may be able 
to reuse the model build for a month to derive the model for 
the first two weeks of the month by removing the last two 
weeks worth of data from the model, instead of building the 
desired model from scratch. 

These examples reveal two basic observations that we ex¬ 
plore further in this paper. First analysis workloads con¬ 
sisting of multiple modelling tasks are amenable to work 
sharing across tasks. In particular one may be able to reuse 
models previously build on a data set in order to derive new 
models on demand. Second, incremental updates (inserting 
or deleting data) is an operation that may aid to derive a 
new model from an existing one. It is natural to expect 
that some models would enable work sharing easier than 
others. Some models for example may allow us to derive a 
new model by "extending” (with new data) or ’’shrinking” 
(removing data) the current model and still derive the ex- 
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act same model we would have derived by building it from 
scratch utilizing base data. Some other models could allow 
us to do this only approximately. At the same time from a 
performance standpoint it may not always be beneficial to 
utilize an existing model and derive a new one by adding or 
deleting data from it. We expect that in some cases utiliz¬ 
ing an existing model to derive a new one may be beneficial 
(we may be able to build the model much faster) but in 
some other cases, building the model from scratch is the 
best (faster) option. 

Currently, systems that enjoy vast attention and are uti¬ 
lized for data analysis tasks (e.g., R 0) do not take advan¬ 
tage of such dependencies and inherent relationships across 
operations of a data analytics workload. An analyst has 
to be aware of work sharing opportunities as well as opti¬ 
mization opportunities and express them (in code) explicitly 
which is not an ideal solution. 

In this paper we initiate a study to explore these possibil¬ 
ities. We introduce model materialization and incremental 
model reuse as first class citizens in the execution of an ana¬ 
lytical workload. By model materialization we mean that a 
model can be stored after it is build in order to be considered 
when generating other models. Since a model requires some 
space to store it, we incur a storage cost but we aim to offset 
such costs with increased performance in executing subse¬ 
quent operations. By incremental model reuse we mean that 
during the decision to build a model required by an analyst, 
we consider models previously build as candidates to gen¬ 
erate the model. Thus, we decide whether we should reuse 
existing models and/or adjust them incrementally or build 
the model from scratch. The decision is typically based on 
performance and we aim to make the choice that results in 
building the model fastest. Towards this goal we adopt a 
cost model that aids in this decision; we develop the suit¬ 
able optimization frameworks that decide which models to 
use and the suitable action to take with the objective of 
producing the resulting model with the smallest cost. 

More specifically in this paper we make the following con¬ 
tributions: 

• We introduce model materialization and incremental 
model reuse as frameworks to be considered during the 
execution of an analysis workload. 

• Using linear regression and Naive Bayes as examples, 
we demonstrate how these common models can be 
casted in our framework. More specifically we estab¬ 
lish that incremental model reuse and model materi¬ 
alization offer large performance benefits, while guar¬ 
antying that models are constructed without loss of 
accuracy. 

• We introduce an algorithm that given a collection of 
materialized linear regression/naive bayes models, chooses 
the best models to reuse and also the suitable oper¬ 
ations in order to modify them deriving the desired 
target model with minimal cost. 

• Using logistic regression as an example, we demon¬ 
strate that incremental model reuse and model mate¬ 
rialization offer large performance benefits while guar¬ 
antying that models are constructed with quantifiable 
loss in accuracy. 

• We introduce an algorithm that given a collection of 
logistic regression models, chooses the best models to 


reuse and the suitable operations in order to modify 
them deriving the desired target model with minimal 
cost. 

• We present the results of an extensive performance 
comparison demonstrating the performance benefits of 
our approach under varying parameters of interest. 

This paper is organized as follows: Section presents in¬ 
troductory material and basic notation. Section demon¬ 
strates incremental manipulation of linear regression and 
naive bayes models, followed by Section [4] that treats the 
case of logistic regression models. SectiorTI^ introduces our 
optimization framework followed by Section that details 
and empirical evaluation of the proposal. Section[^discusses 
related work and Sectionconcludes the paper. 


2. BACKGROUND 

We provide basic notation and a brief introduction to the 
techniques we adopt to showcase our overall approach. A 
more detailed description of the algorithms is available else¬ 
where [s 

2.1 Linear Regression 

Linear regression is modelling the relationship between a 
scalar dependent variable and one or more independent vari¬ 
ables. Consider a data set of n records; each record a; is a d- 
dimensional feature vector of independent variables denoted 
by Xi and a target dependent variable yi € ffi. Generally, a 
linear regression takes the following form : 

yi = W^X; -I- Ci 

where w is the weight vector which is estimated and Ci is an 
error term. Usually, the weight parameters are learned by 
minimizing sum of squared errors. A La-regularization term 
is added to avoid over-fitting of the model. The solution thus 
obtained has a closed form and is represented as : 

w = (X^X + A/)“^(X^y) (1) 

X is a n X d matrix of the input vectors, y is a n x 1 matrix 
of the target values and A is the regularization parameter. 
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2.2 Naive Bayes Classifier 

Naive Bayes classifiers are simple probabilistic models as¬ 
suming pair-wise independence of features given the class 
label. Albeit simple. Naive Bayes models perform very well 
in classification problems |20| . Given a class variable Y and 
a set of predictor variables Xi,...,Xd Bayes theorem states 
that 


P{Y = c\xi, ....,Xd) 


P{Y = c).P{xu-:Xd\Y = c) 
P{x-i, ....,Xd) 


Under the naive assumption and given that P{xi, ....,Xd) 
is constant for a particular training set we can conclude that 


d 

P{Y = c\xi, ....,Xd) oc P{Y = c).Y[P{x,\Y = c) 

i = l 

P(Y = c) can be calculated from training data by maxi¬ 
mum likelihood estimation. The class probability P(Y = c) 
is simply the relative frequency of class c in the training set, 
P{Y = c) = Nc/N where W is number of training example 
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which have class c and N is the total number of training 
examples. 

Depending upon the choice of distribution for the condi¬ 
tional density P{x\, Xd\Y = c) we have variations of the 
Naive Bayes classifier. A popular choice in the case of real 
valued features is the Gaussian distribution. 

d 

P{xi, ....,Xd\Y = c) = YljV{xj\njc,(x%) 


of models we consider. We discuss how one can combine two 
models on different data sets to produce a new model on 
the union of the data sets. We also discuss how an existing 
model can be manipulated (by adding or removing data) to 
produce a new one. Formally, let Mi be a model on data set 
Di and M 2 is the model on data set D 2 . We assume that 
the data sets Di and D 2 have the same properties. We dis¬ 
cuss two machine learning models described in the previous 
section, Linear Regression and Naive Bayes. 


where ^jc is the mean of feature j in samples with class 
label as c and is its variance. This is often referred to 
as Gaussian Naive Bayes. In case of categorical features 
the multinomial distribution is a preferred choice for condi¬ 
tional density. The distribution is parametrized by vectors 
9c = {9ci, .■■,6cd) for each class, d is the dimension of the 
feature vector and 6ci is the probability P{xi\c) of feature i 
appearing in sample belonging to class c. 


P(®i,....,®,|y = c) = (^*0!n2 

i i=l 


9ci can be calculated by a smoothed version of maximum 
likelihood estimation. 


^ _ Nci -f 1 

" “ Nc + d 

where A,, = EJ.i x\^^\Y = c] , A, = Eti E".! xl^^\Y = 
c] and n is the total number of points in the training set. 
These counters are computed for each class in the training 
data. 


2.3 Logistic Regression 

Logistic regression is a linear classifier belonging to the 
family of Generalized Linear Models [^. Let y denote a 
class variable and x represent a feature vector, then Logistic 
Regression can be formally represented as an optimization 
problem minimizing a loss function to identify the model 
parameters. The loss function has the following form 

F{w) =L{w,x^’'\y^'^^) + XR{w) (2) 

n ^' 

i = l 


A very common choice for function L in logistic regression 
is the cross entropy loss function : 


L{w, x^'\y^’‘^) = y^^hogh^{x''''’) + (1 - - h^{x^-^’)) 

and regularization function R{w) = ||ui||^. Here hw{x) is the 
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logistic function hw{x) = 

The Stochastic Gradient Descent(SGD) algorithm is 
used to optimize the loss function to determine the model 
parameters. SGD initializes the model parameter w to some 
Wo and then updates the parameter as 


w w — a'VFi{w) 

where a is the learning rate and VFi(w) is the gradient of 
the convex loss function just using the sample. Stochas¬ 
tic gradient descent requires a single pass on the data to 
converge. 


3. AN INCREMENTAL APPROACH 

We now demonstrate how model materialization and in¬ 
cremental model reuse can be supported in each of the types 


3.1 Model Materialization 

A typical machine learning model is characterized by its 
parameters. In order to support incremental updates to a 
given model extra information has to be maintained depend¬ 
ing on the model. We show that while materializing a model 
we can also materialize extra information that would be suf¬ 
ficient in supporting incremental updates. This information 
varies across different types of models as discussed further 
in this section. 

3.1.1 Linear Regression 

Let D be a data set of n points and let M represent a 
machine learning model build on this data set. 

Parameters for a linear regression are provided by Equa¬ 
tion 3. The equation can be considered as a combination of 
two terms A = X^X and B = X^y. Simplifying the terms 



'spn U) uy 

^d 

(j) (j) 

_z^j = l‘^d *^1 

Z^j=l^d -^d 






.^j=i Xd y 


where A is a, dx d matrix and each term is the sum product 
of any two features of the feature vector over the n training 
samples. X^y is a d x 1 matrix where each term is the 
sum product of the features and the target values. We will 
maintain matrix A and B, along with the model parameters 
while building a model. Thus we end up maintaining d^ + d 
extra values. It is important to note that the amount of 
extra information we have to maintain is independent of 
the number of training samples (n). Given that we have 
both the components A and B we can compute the model 
parameters at any point using equation Later on we will 
show how we can support incremental updates to Linear 
Regression model utilizing this information. 


3.1.2 Naive Bayes 

As discussed in section |2.2| Gaussian Naive Bayes is 
parametrized by the following variables: the class prior 
probabilities P{Y = c) = , yijc and the parame¬ 

ters explaining the conditional density distribution. These 
parameters can be computed as shown below 

n 

Nc = = c] 


f^jc — 


Er. 


j,W|-y(j) _ 

^Nc 
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We maintain Nc for each class in the data set, which is 
the number of samples belonging to each class. In order 
to calculate Hjc we maintain the sum of feature j over the 
samples in class c, represented by Sjc- Similarly for ajc we 
maintain the sum of squares of the values of feature j in class 
c, represented by SSjc- Maintaining the statistics above we 
calculate all the parameters of the model. Assuming we 
have C classes in total in the data set, we need to maintain 
0(d X C) values. This is again independent of the number 
of training examples (n). 

The multinomial Naive Bayes model also has the same 
class prior probabilities P(Y = c) = ^. In addition we 
have to maintain 6^ for which we need to also store Nd 
and Nc- These parameters are expressed as sum of feature 
values across the classes. For the case of the multinomial 
model, we need to maintain 0{dxC) number of parameters 
for the model. 

3.2 Incremental Model Updates 

In this section we demonstrate how incremental changes 
(data additions or deletions) can be supported by the two 
models considered. Formally, let M be a model build 
on data set D consisting of points n. We will demon¬ 
strate the incremental changes by considering adding point 
(pi, ...,Pd,y) to the data set D, where d is the dimension of 
the data. We wish to find the parameters of the new model 
M' for data set D' = DU (pi ... pd, y) of size n -|- 1. 

3.2.1 Linear Regression 

For the linear regression model M we have already com¬ 
puted matrix A and B on data set D. We will calculate the 
Al and B' on D' by operating on A and B and updating 
them to reflect the new point. The equations below show 
how to update matrix A and B: 

n 

n 

B'ii =^x'f^y^^'^ +piy 

Deletions are handled similarly. Larger collections of 
points can be added/deleted in a similar fashion. Other 
statistics computed while building regression models like 
ANOVA table, AIC etc. which explain the goodness of fit of 
the model can also be incrementally maintained in a similar 
fashion. Details have been omitted for brevity. 

3.2.2 Naive Bayes Classifier 

For the Naive Bayes model M we have computed Nc , 
Sjc and SSjc on D. We can update these statistics for D' 
according to the equations below 

N'c = Nc + [y = c] 

S'jc = Sjc +Pj[y = c] 

SS'jc = SSjc+pfiy = c] 


Given that we have the updated statistics we can compute 
the parameters of the updated model M'. Similar observa¬ 
tions hold for deleting data as well as operating on collec¬ 
tions of points. 

3.3 Combining Models 

Let D be the underlying data set of n points. Assume that 
points in D are associated with a unique identifier, namely 
a point p e D is represented as p = (id, y, x), where id is the 
identifier, y the dependent (class) variable and x the feature 
vector as before. To simplify notation for the remainder 
of the paper, we assume, without loss of generality that 
the unique identifier imposes a natural ordering in D. For 
example id could be a time-stamp associated with the point 
(indicating the time it was generated). Casting our entire 
framework for the case where the points of the underlying 
data set D do not have a unique ordering is indeed possible. 
It requires however a different methodology and we defer 
description of this case in our subsequent future work. Also 
for brevity we will denote as Di both the model and the data 
set (subset of D) for which we wish to build a model on. A 
sequence of these data point identifiers determines a model 
descriptor which is a range of points in D. Let Di and D2 be 
data sets represented by model descriptors d{Di) — [01,61] 
and d{D 2 ) = [ 02 , Our aim is to compute the model 
Dc = D\ U D2 

We discuss the linear regression case. Naive Bayes models 
are handled similarly so we omit the description for brevity. 
Let Di and D2 be two linear regression models. For each 
model we maintain the associated matrices A = X and 
B = along with the model descriptor signifying the 

data set on which it was calculated. Computing the regres¬ 
sion model Dc = Di U D2 , involves considering two cases: 
Case 1: The two data sets do not have any points in com¬ 
mon i.e. D\ n D 2 = 4>\ this case can be easily identified by 
comparing the model descriptors of the two data sets. A 
specific entry in the matrix X^X for model Di looks like 

E Xa Xj,^ , where a and b are any two features. Thus, it can 
3 

be seen that the corresponding matrix A on data set Dc can 
be computed as 

Dc D-i D2 

V _ -sp AAAA , "sp AAAj) 

/ . Xg Xf, — / , Xg Xj, -r / , Xg X f, 

3 3 3 

which is essentially adding the corresponding elements of 
matrix A of the two models directly. 

Case 2 : The two data sets have points in common i.e DiC] 
D 2 (j>'i in this case the points common to both data sets can 
be determined from the corresponding model descriptors. 
If we directly operate on the two models the points which 
are common will be accounted for twice. Thus, we need 
to exclude points represented in both model and make sure 
we account for them once in the final model. We compute 
matrix A on data set Dc as follows: 

Dq Di D2 Dir]D 2 

Aj)r^U) I 

/ V ~ / . a ^ / V -^b / X ^b 

j j j j 

Dc Di D2—D1 

Y^ ^u)^u) — I Y^ 

/ ^ ^b ~ / -^b ' / V '^b 

3 3 3 
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Dc -D2 D1—D2 

E „U)„U) U) U) , ^U)^U) 

■^a -^b — / , -^a -^b ^ / , -^a -^b 

3 3 3 

The matrix X^y for Dc can be computed in a similar 
fashion. Notice that in this case we need to retrieve a few 
extra points from Di, D 2 - This incurs an 10 cost that needs 
to be accounted for (see section]^. 


4. INCREMENTAL LOGISTIC REGRESSION 
MODELS 

Stochastic Gradient Descent(SGD) is a popular optimiza¬ 
tion framework for estimating parameters of a Logistic Re¬ 
gression model. SGD is a sequential algorithm that updates 
weight parameters at each iteration until convergence. A 
typical drawback of SGD is its poor scalability on large data 
sets. Recognizing the importance of analytical tasks on mas¬ 
sive data sets, recent work has established methodologies to 
scale SGD into realistic data sets 
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We adopt such 
methodologies and extend them to fit our framework. 

A generic loss function for the Logistic Regression model 
is given in Equation]^ SGD is applied to identify the model 
parameters w which minimize the loss function. We describe 
a va rian t of the SGD algorithm called Mixture Weight Meth¬ 
ods [16| . Let us consider a sample S = (Si, ....,Sp) of pm 
points formed by p sub-samples of m points each drawn 
i.i.d. Si, ...Sp. Algorithmoutlines the steps for executing 
Mixture Weight Method. Notice that the outer-loop of the 
algorithm can be executed in parallel and as a result the 
approach can easily utilize multiple processors if required. 


Algorithm 1 Mixture Weight Method 
1: for all i £ {1, ...p} do 

2: Wi •<— 0 

3: for t 1 to T do 

4: VEs. (w) ^ GRADIENT(Fs. (w)) 

5: Wi-s—Wi-I-A(VT’ s-(m))) 

6: end for 

7 : end for 

p 

8: Aggregate all Wp = pkWk 

fc=i 


in the increasing order of ID values. A chunk Si is given by 
the following range 

Si = [a + {i — 1) * l,a + i * 1] 

and i £ {1,..., Assuming that the logistic regression 

models for each chunk are available, they are combined in 
the spirit of algorithm and produce the model for Dq. 
Assuming that none of the chunks is available, a request to 
build the model for Dq can utilize the base data to build 
the logistic regression model. At the same time, the chunks 
are generated for Dq, the logistic regression model build 
for each of them, and the result is materialized in order to 
benefit future model creation requests. 

Any request to build a logistic regression model for a data 
set Dq first tests whether Dq contains any of the chunks for 
which a model has already been materialized. If it does 
we can readily utilize its parameters and save computation 
time. Any parts of Dq that are not currently "covered” by 
existing chunks have to be computed from the base data set. 
Thus, we retrieve the parts of Dq for which we don’t have 
the model, generate chunks of size I and compute the model 
parameters for them. Finally we average all parameters from 
all chunks to compute the model. Algorithm [^presents our 
overall approach. 


Algorithm 2 Incremental Logistic Regression 

1: procedure Incremental Logistic Regression (D,) 
2: 5 ranges in Dq for which a model already exists 

3: ^ {} 

4: for all the ranges r £ S' do 

5: Dq Dq — r 

6: Pi Linear Regression parameters for r 

7: P ^ PUP, 

8 : end for 

9: Sort Dq in increasing order of ID values 

10: Create chunks of size I from Dq 

11: Compute Linear Regression parameters on each 

chunk I and add to P 
12: Average all parameters in P 

13: end procedure 


Where Fs, is the optimization function for sample Si and 
T is the number of iteration required to converge. Thus, 
algorithm computes the model parameters on subsets of 
data and then averages the parameters across all the sub¬ 
sets to compute the parameter for the complete set of data. 
In |16| it is shown that Algorithm has good convergence 
properties and under certain assumptions establishes a rela¬ 
tionship between the Wp estimated and the values computed 
executing SGD on the entire data set. 

We extend this idea in our framework as well. Let D 
be an underlying data-set of size n and a point p £ D is 
represented as p = {id,y,x.), where id is the identiher, y 
the dependent (class) variable and x the feature vector as 
before. 

A request to create a logistic regression model on data set 
Dq (the query set), is represented by a range of id values 
[a, b] over D such that b — a = \Dq\ -|- 1. The query data 
set is segmented into smaller chunks of equal size I with the 
obvious assumption that I < \Dq\/2. This results into 
number of chunks of equal size. These chunks are created 


Theorem fT] es tablishes a relationship between the outcome 
of Algorithm 2 on Dq and that computed by applying SGD 
directly on Dq. 

Theorem 1. LetWp denote the mixture of weight vector 
obtained by applying Algorithm^ on a model query Dq and 
ysGD be the weight vector computed by applying SGD on 
Dq. Then, for any 5 > 0, with probability at least 1 — 5, the 
following inequality holds: 

,, ,, ^ R-i/2, 1 1 , 2y/2R , , 

Ik. - ITsokI < —(^ + ^ 

where R is the bound for the norm of feature vectors, A 
is the regularization constant, p = the number of 

chunks of Dq created in step 10 of Algorithm]^ I is the size 
of each chunk and 1 — 5 represents the probability with which 
this inequality holds. The proof of[^follows the methodology 
presented in [16] and is available in the full version of the 
paper [^. 
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Note that in contrast to the discussion of section [3^ for 
logistic regression models, this framework supports adding 
points to an existing model not deleting them. Thus we 
can construct new models only by adding points to existing 
models (combining existing chunks). This is inherent to the 
nature of the approximation of the logistic regression. As a 
result the space of all possible options to consider when cre¬ 
ating a new model considers addition of points to an existing 
model, not deletions. 


as C{Di). The cost of retrieving n data points from disk is 
denoted as F(n). 

Let S' be a collection of materialized models on data set D. 
For a model Dq, let d{Dq) = [lq,Uq\ be a model descriptor 
on which a new model has to be computed. Iq and Uq in 
this case express a range of data points on D. We wish to 
identify the minimum cost collection of materialized models 
and/or data points from D that would be used to construct 
the model for d{Dq), Dq. 


5. OPTIMIZATION CONSIDERATIONS 

Given a collection of materialized models over a data set 
D, it is evident that a request to create a new model Dq can 
readily utilize existing models. We seek to understand the 
trade offs involved while building the new model Dq. Several 
options are available including building Dq by manipulating 
data from D or utilizing materialized models directly and/or 
suitably adjusting them using data from D. 

Consider Figure It depicts data set D and four mate¬ 
rialized models [D\, D2, D^, D4). A request to build model 
Dq is faced with numerous options. Using the materialized 
models to generate model Dq , Equations and show 
different ways in which this can be achieved 


Dq = D 3 + Da - [ 6 , c] - [e, /]. 


(3) 


Dq = D 3 -b Da — {Da — D 2 ) — [e, /]. 


(4) 


Dq = [c,d] + Da - [e,f]. 


(5) 


Equation represents an execution strategy which will 
fetch models D 3 and Da combine them, then remove all 
points in the range of [&, c] and [e, /] (this constitutes in¬ 
crementally updating, removing these points, from the com¬ 
bined model). This step consists of accessing D and retriev¬ 
ing all points between [ 6 , c] and [e, /]. In equation [^instead 
of retrieving [ 6 , c] from D, we compute that operation by ma¬ 
nipulating (subtracting) models D 2 and Da . If the model al¬ 
lows (e.g., linear regression) we can subtract D 2 from Da and 
compute the model for [ 6 , c] directly. Similarly, Equation 
represents another execution strategy which involves retriev¬ 
ing Da along with data points between [c, d] and [e, /] and 
manipulating them (incrementally updating, adding and re¬ 
moving points) to complete the model construction. Other 
choices are also possible including retrieving all points be¬ 
tween [c, e] from D and computing the model directly from 
base data. In order to be able to quantify the merits of each 
choice, as is typical in cost based query optimization flO 


we need to a) assess all possible choices efficiently and b) 
quantify the cost of each option in order to determine the 
least cost way to build the model. 

The specifics of the cost model are orthogonal to our ap¬ 
proach. The cost depends on the type of model and also the 
model descriptor which may or may not involve disk access. 
In addition retrieving data from D typically involves disk 
access. The only requirement we impose in the cost model 
adopted is to be monotonic. This means that all things be¬ 
ing equal, the cost of retrieving a certain number of data 
points from disk should be at least as costly as the cost of 
retrieving less points. For the remainder of the paper we 
assume a cost model C that is monotonic. To facilitate no¬ 
tation the cost of using a materialized model Di is denoted 


Definition 1. Let d{Dq) = [lq,Uq] represent a model de¬ 
scriptor for model Dq which we wish to construct and S 
be the set of available materialized models. Then the set 
Sr C 5 of relevant models for Dq is defined as follows : 

1. If for a materialized model Si G S, d{Si)r]q 0 , then 
Si G Sr. 

2. 'iS'i G S such that 3 Sj G Sr with d(S'') n d{Sj) 7 ^ 0 
then S'i G Sr. 


Intuitively the models in Sr are relevant models because 
they either contain common data points with the ones of 
interest to Dq and/or they are models that can be manip¬ 
ulated (by combinations of models or incremental updates 
of models) to produce models that assist in computing Dq. 
As we can see in Figure [Ta| materialized models D 3 , Da con¬ 
tain data points common with Dq while Da and D 2 can be 
manipulated along with D 3 to produce models relevant to 
the computation of Dq. While computing Dq, only relevant 
models will be part of Sr. 


Algorithm 3 PreprocessDescriptors (S) 

1: enhancedDeseriptors <r- mapping of descriptors and the 
corresponding materialized models 
2; descriptor -h- a model descriptor represented by [Z, u] 

3: array Descriptors array of descriptors 
4: Sort S in increasing order of I values 
5: descriptor[0] •<— I value of first descriptor in S 
6 : descriptor [1] •<— u value of first descriptor in S 
7: arrayDescriptors •<— append first descriptor in S 
8 : for each descriptor r G S' do 
9: if r overlaps descriptor then 

10 : descriptor[l\ •<— max(descnpZor[l], u value of r) 

11: arrayDescriptors t— append r 

12: else 

13: enhancedDeseriptors. put (descriptor, arrayDescriptors) 

14: arrayDescriptors {} 

15: arrayDescriptors •<— append r 

16: descriptor[ 0 ] t— I value of r 

17: descriptor)!] •<— u value of r 

18: end if 

19: end for 

20: return enhancedDeseriptors 


The set of relevant models Sr is important since it ac¬ 
curately reflects the set of models to be considered during 
the computation of Dq. Instead of assessing all relevant 
models every time a new request for a model Dq arises, we 
pre-process the collection of all materialized models S to fa¬ 
cilitate the derivation of Sr for a given Dq. Thus given S we 
pre-process it to facilitate the computation of relevant mod¬ 
els. Algorithm presents the overall approach. The basic 
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Query Path : (c, a, b, f, e) 



(a) Materialized model state (b) Query Graph using cost model 
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Rewritten Plan: 

Dq = - + D2 + D3 + D4 - [e,f] 

(c) Query Path Conversion 


Figure 1: Graph modeling to find optimal execution strategy for query interval Iq 


idea is to pre-process S and create enhanced descriptors that 
are the union of multiple model descriptors. Such enhanced 
descriptors can facilitate quick search for relevant models. 

Running algorithmj^in the example of Figurej^will pro¬ 
duce two enhanced descriptors namely [a, d] formed by com¬ 
bining descriptors for models {Di, D 2 , Dz] and [d, f] which 
constitutes the descriptor of model {D4}. 

Maintaining enhancedDescriptors makes it easier to com¬ 
pute the set Sr. When the descriptor of a model Dq is pro¬ 
vided, we compare it against the enhancedDescriptors. If a 
descriptor intersects any of the descriptors in enhancedDescriptor 
all the materialized models mapped to that descriptor be¬ 
come part of Sr. 

Algorithm will produce the set Sr of all models that 
should be considered in deriving model Dq. Using the de¬ 
scriptors in Sr we create a complete undirected graph G{V, E) 
where each node v £ V corresponds to the I ox u values of 
the model descriptions in Sr. As for our running example 
the set Sr contains models D\ to D 4 . Thus we add the I and 
u values of the descriptors of these materialized models. As 
we can see in figure it contains a to / as nodes. An edge 
e £ E corresponds to the cost of building a model for the 
data set specified by the two nodes adjacent to e. If mate¬ 
rialized model M exists for the data descriptor specified by 
the nodes adjacent to the edge e then the cost of the edge 
is the cost of using model M,C{M). If a model does not 
exist for that data set the cost of that edge is determined by 
the number of points in the range. In our example the solid 
edges in our graph represent the materialized models Di to 
D4,. For all the other edges the cost is given by E{n), where 
n is the number of points in the interval represented by the 
edge. Given Dq and d{Dq) = [lq,Uq] values lq,Uq represent 
the source and destination respectively. These are shown as 
grey nodes in Figure 

Every path from source node to destination represents 
an execution strategy to construct model Dq. Figure [^il¬ 
lustrates how to convert a path on the graph to a set of 
operations that compute the model. Consider a path on 
the graph represented by the following sequence of nodes 
(c, a, b, d, f, e). We fetch four materialized models Di, D 2 , Ds 
and Di for the edges {c,a),{a,b),{b,d) and (d,/) respec¬ 
tively. The edge (/, e) does not correspond to any materi¬ 
alized model , thus cost of that edge is equivalent to fetch¬ 
ing the corresponding data points from disk. The decision 


Algorithm 4 Identify Optimal Execution Path 

1: procedure GenerateGraph(Sa, Dq, C'(M), F’(n)) 

2: initialize Graph G{V, E) 

3: for each descriptor r £ Sr do 

4: G add vertices corresponding to I and u values 

of r 

5: G add an edge between two new vertices with 

weight G{Dr) 

6: end for 

7: for each vertex v £ G do 

®8; for each vertex u £ G do 

9: if (no edge between u and v) u ^ v then 

10: G •<— add an edge b/w u v with weight 

F{\u-v\) 

11: end if 

12: end for 

13: end for 

14: return G{u, v) 

15: end procedure 

16: procedure OPTlMALPATH(5'ij, Dq, C'(M), E(n)) 

17: Identify Sr using algorithm PreprocessDescriptors 

18: G ^ GenerateGraph(5i;, Dq, C{M), E{n)) 

19: Apply Dijkstra’s Algorithm using d{Dq) I and u val¬ 

ues as source/destination 
20: Return the shortest path 

21: end procedure 


whether to manipulate an existing model by adding or re¬ 
moving data points from it is decided by the nodes of the 
edge. If we traverse the edge {i,j) from i to j and i > j 
then we remove points from the model otherwise we add 
data points. In our example edge (c, a) c > a (as indicated 
in Figure la I and that constitutes removing points. The 


total cost of a query path is given by 

k 

C{Dq) = ^ cost{ei) -I- (A: - 1) * Cmerge 
i 

where cost{ei) is cost of each edge and Cmerge is cost of merg¬ 
ing two materialized models. The cost Cmerge depends on the 
type of model under consideration. For example for linear 
regression the cost is outlined in section 3.3 It involves (af¬ 


ter retrieving the model parameters) a simple manipulation 
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of corresponding model representations. It is expected that 
the cost of merging two materialized models is much less 
than the cost of fetching models or the cost of fetching data 
points from the disk {cmerge <C ei). Depending on how the 
model descriptors and model parameters are stored, retriev¬ 
ing them may not require any disk access. For example in 
the case of a linear regression model, the model descriptors 
would be just a range of values and the model parameters 
would be as outlined in Section [3. 1.1 1 

It is evident that by construction the problem of identify¬ 
ing the minimum cost to construct the model Dq is equiva¬ 
lent to identifying the shortest path from a single source in 
a weighted graph. Dijkstra’s algorithm can be used to iden¬ 
tify the optimal solution in 0{\E\log\V\), \E\ is the number 
of edges and \V\ is the number of vertices in the graph. 

We presented the entire solution for the case of models 
that support addition and removal of points to derive new 
models, as is the case of models such as linear regression and 
Naive Bayes. For the case of logistic regression removal of 
points is not supported in the model we utilize to approxi¬ 
mate the regression. In this case we have to modify slightly 
the algorithm to enable optimization of logistic regression 
models as well. The changes are as follows: 

• During identification of the set Sr we will include mod¬ 
els such that their descriptors are fully contained in the 
descriptor d{Dq). 

• The graph G constructed will only contain directed 
edges from nodes i to j such that i < j. 


simulate real world scenarios. In this section we present re¬ 
sults using data sets up to 5 millions points with 10 features 
in each point. We tested all algorithms with synthetically 
generated data sets of larger sizes but the trends observed 
in our experiments were nearly the same. In addition we 
utilized popular real data sets from UCI Machine learning 
repository in our experiments and in all cases the results 
are consistent with those presented herein for synthetic data 
sets. 

Experimental Setup. All our experiments were proto¬ 
typed on top of MySQL(version 5.5.44) in a single node 
RDBMS setting. The model materialization framework code 
has been written in Python. The experiments were carried 
out on a PC running Linux Kernel Version 3.13.0-43-generic. 
The machine has a 3.40GHz Intel Core 17-3770 CPU with 
16 GB of main memory. 

Our framework is naturally parametrized by the size of 
the materialized models (Z) and the size of the incoming 
model construction query (Dq). Another important param¬ 
eter which is implicit in our discussion is the amount of data 
covered by the materialized models. Materialized models 
can be spread uniformly across the data set or may be con¬ 
centrated on a few data points. To quantify the coverage 
we compute the number of unique data points covered by 
the materialized models and express it as a percentage of 
the total size of the data set. Formally let be 

the collection of models materialized at a given stage in the 
framework. For the data set, D, coverage is defined as fol¬ 
lows : 


These two changes will enable algorithm to operate on 
logistic regression models and yield the least cost options to 
construct such models as well. 


6. EXPERIMENTS 

In this section we present a detailed performance compar¬ 
ison of our entire approach and proposal compared to al¬ 
ternate approaches. We utilize materialized models to save 
processing costs, while building new models for an incom¬ 
ing (model construction) query Dq as described in section 

The natural alternative is not to materialize models, but 
instead build the new model directly from the raw data. We 
compare our approach against this baseline. Our aim from 
these experiments is three-fold : (a) Highlight the factors 
that affect performance for our materialization framework 
and associated trade-offs, (b) Detail the impact of our op¬ 
timization framework in terms of its overheads and benefits 
and (c) analyze the accuracy of logistic regression materi¬ 
alization framework. Note that for the case of the linear 
regression and naive Bayes models, the models we construct 
are exactly the same as those constructed by the baseline, 
so there are no accuracy trade offs in these cases. 

Data. We test our framework utilizing synthetically gen¬ 
erated data. Two different data set are generated for re¬ 
gression and classification problem. The choice of synthetic 
data allows us to change various parameters during exper¬ 
imentation. In addition experiments are focused on perfor¬ 
mance while scaling the size of the model and performance 
does not depend on quality of data but is governed by the 
size and type of data. The data is generated using publicly 
available synthesizers 18 . A random noise and interdepen¬ 


dency among features is added while synthesizing data to 


^ |Di UD2...UD„| 

Coverage(%) = •— ^ -- x 100 

These parameters are varied across our experiments to 
understand their impact on performance gain. Let Dq be 
a model construction query. Our optimization framework 
identifies the optimal way to build model Dq. Let the over¬ 
all time taken by our framework to build the model be T 
(including the optimization and model construction time). 
Let the time taken by the baseline be Tq. Then the perfor¬ 
mance gain is calculated as follows 


T 

Performance Gain{PG) = — 

To 

In all experiments we report expected numbers. A query 
set S containing one thousand queries is generated for each 
experiment. The query size is chosen from a uniform or nor¬ 
mal distribution as explained in individual sections. These 
queries can represent a range of data points which is posi¬ 
tioned anywhere across the underlying data. Similarly the 
materialized model size (Z) is also chosen from a uniform 
distribution, normal distribution or a fixed size. We cre¬ 
ate a set of materialized models M on the data set with a 
given coverage as required in the experimental setting. The 
models are materialized before executing the query set S. 

6.1 Analyzing Performance 

We assess the overall performance gain attained by our 
approach as compared to the baseline. Experiments were 
run for all three machine leaning models Linear Regression, 
Naive Bayes and Logistic Regression. The sizes of the sets 
M and S are chosen from the same normal distribution, 
Af{ 50 K, 12.SK). The x-axis depicts the percentage of data 
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(a) Naive Bayes Model 


COVERAGE 

(b) Linear Regression Model 



COVERAGE 

(c) Logistic Regression 


Figure 2: Performance gain against coverage percentage 




MATERIALZED MODEL SIZE 


(d) Naive Bayes Model 


(e) Logistic Regression Model 


Figure 3: Performance gain against materialized model size 


covered by materialized models. We execute the queries in 
set S and report the performance gain. Figure and |2b| 
show that we were able to achieve a performance gain of 2x 
as the coverage reaches 90%. The increase in coverage im¬ 
plies a higher probability of identifying relevant models for 
the query. Thus the expected performance gain improves as 
the coverage increases. The performance gain for Logistic re¬ 
gression is shown in Figure The maximum performance 
gain achieved in logistic regression is 1.8x which is slightly 
lower than the earlier two models. This can be explained by 
the fact that for Logistic Regression our framework supports 
only incremental updates to materialized models (section|^. 
Thus, it eliminates certain execution strategies which would 
have been faster in the presence of decremental updates. 


Coverage 

Model Sizes (MB) 

20 % 

1.5 

40% 

1.8 

60% 

2.5 

80% 

3.5 

90% 

4.5 


Table 1: Disk space occupied by materialized models for 
various coverage(%) 

The previous experiment demonstrates that utilizing ma¬ 
terialized models can have a profound effect on performance 
when constructing new. However materializing a model 
comes at a cost, namely that of storing the model descrip¬ 
tors as well as the model details (e.g., regression parameters 
and meta-data in the case of linear regression as defined in 
section]^. Tabledepicts the space occupied by the mate¬ 
rialized linear regression models for each value of coverage. 


The size of the materialized model is fixed at 5K points. 
The base data set size is 350MB containing 5M points with 
10 features. As it is visible from the table, the overheads in 
storage imposed by the materialized models is around 1.2% 
of the original data. Similar trends hold for the other mod¬ 
els of interest in our study. It is evident that the minor 
storage overheads are heavily compensated in light of the 
performance benefits. 

6.2 Materialized Model Size and Performance 
Gain 

The size of materialized models is an important parameter 
in our framework. With the next set of experiments we wish 
to understand the impact of the size of materialized models 
on performance. Two test query sets SI and S2 of size 50K 
and 100k points are used as shown in the figure]^ On the 
x-axis we represent different materialized model sets of Hxed 
size of coverage hxed to 50%. The size of the materialized 
model sets is varied from 5K points to 70K points as shown 
in the Figure and We present results for Naive Bayes 
(supports both incremental and decremental updates) and 
Logistic Regression (supports only incremental updates) as 
similar trends hold for linear regression as well. Figure [2dl 
[^present results for Naives Bayes and Logistic Regression 
respectively. We observe that for a hxed query size Dq and 
hxed coverage there is an optimum size of materialized mod¬ 
els which results in maximum performance gain. We achieve 
a maximum performance gain for SI at materialized model 
size of 20K for Naive Bayes. Similarly, for Logistic Regres¬ 
sion we achieve the maximum performance gain at lOK ma¬ 
terialized model size. As the size of the query increases the 
optimal materialized model size also increases. As shown in 
the graphs the query set S2 has its maximum at 30K and 
20K for Naive Bayes and Logistic Regression respectively. 
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(a) Small Size Model Query 


(b) Large Size Model Query (c) Small Size Model Query on Real Data 


Figure 4: Time take by large and small queries for various materialized model sizes 


which is larger than the maximum for SI. The exact posi¬ 
tion of the maximum on the graph depends on the size of 
the specific query (or query workload for multiple queries) 
for a given cost model. 

6.3 Materialized Model and Query Size 

We conducted experiments to quantify performance while 
scaling to larger input queries and materialized models sizes. 
The model chosen for these experiment was Naive Bayes, al¬ 
though linear regression also shows the same trends. Figure 
ID shows four sizes of materialized models under considera¬ 
tion Ml to M4. Ml represents materialized models with 
their size chosen from a uniform distribution represented by 
U(25k,50k). Thus Ml is the scenario in which all the ma¬ 
terialized models have a size uniformly distributed between 
25K to 50K. Similarly M2,M3 and M4 are represented fol¬ 
lowing a uniform distribution U(75k,100k), U(150K,200k) 
and U(250K,500K). Figure [4a| shows the time taken to ex¬ 
ecute queries of small sizes represented by U(50K,100K). 
As depicted in the graph for Ml and M2 the time taken 
to execute the model queries decreases linearly as cover¬ 
age increases. However for M3 and M4 which correspond 
to considerably larger materialized model sizes, the perfor¬ 
mance improvement becomes significant after 70% cover¬ 
age. As coverage increases there is a higher probability to 
find two materialized models which can be subtracted in 
order to create a smaller model. Figure shows similar 
trend for small queries on a real world data set from the 
UCI machine learning repository representing physical ac¬ 
tivity data of 3M points, consisting of 31 attributes and 13 
classes. It is evident that the main trends are the same as 
in the case of synthetic data set as is the case in all of our 
experiments. Figure |4b| is the graph for larger query sizes 
represented by distribution U(500K,750K). Since the query 
size is much larger we can observe that all four cases ma¬ 
terialized models are utilized to generate the model for the 
input query. For Ml, small models can be combined to gen¬ 
erate the models for larger data sets. While for M4 a large 
materialized model which has the maximum overlap with 
the incoming model construction query is manipulated to 
generate the new model. It is evident that the relationship 
of the query size to the materialized model size is impor¬ 
tant in our setting. When the query workload has a much 
smaller size than the materialized model sizes (correspond¬ 
ingly when the query workload has much larger size than the 
materialized model sizes) employing our framework does not 
result in large performance benefits. It is evident however 
that enabling our framework in these cases does not impose 



Figure 5: Distribution of time across various I/O and com¬ 
putation tasks 


an overhead either. 

6.4 Optimization and I/O Time 

As mentioned in section the cost of merging models is 
considerably smaller as compared to disk access time. We 
measure the time taken by the three major components of 
our framework namely optimizer time, disk access time (in¬ 
cluding both fetching materialized model and/or fetching 
direct data points) and model combination time. The op¬ 
timizer time refers to the time taken to run algorithm 
The time spend in fetching any information from MySQL 
is referred to as I/O time. The time remaining in our com¬ 
putations which cannot be attributed to the above cases 
is the time taken to merge the models. Experiments were 
run on a test set of a thousand queries. The size of the 
model to be generated is chosen from the normal distribu¬ 
tion Af{50K, 12.5K). 

The expected time for each component is reported as 
shown in graph[^ As can be observed the majority of time to 
create models is spent while fetching data from disk. Model 
combination time is fairly constant and is much smaller as 
compared to disk time. Optimizer time is insignificant for 
small coverage and only becomes visible (but still negligible) 
on the graph when coverage is close to 80% and above. As 
coverage increases the number of possible execution plans 
become considerably larger thus the optimizer takes much 
longer to build the graph and determine the shortest paths in 
the graph. This graph reveals that the overhead of running 
the optimization is minimal. Since the potential benefits of 
considering materialized models are significant, it is evident 
that if one chooses to materialize models, the performance 
overhead of the optimizer is negligible. Thus, running the 
optimizer, even if the decision is to employ the baseline, 
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Figure 6: Accuracy and Performance statistics for Logistic Regression with materialized model size of lOK 
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Figure 7: Accuracy and Performance statistics for Logistic Regression with materialized model size of 20K 


imposes minimal penalty in the query performance. In the 
graph the baseline is represented by the x-axis value at zero 
percent coverage. It can be seen that disk time reduces 
from 250 ms to 110 ms, while the optimizer time and model 
combination time are roughly 10ms. Thus, when the cov¬ 
erage is low, the overhead of the optimizer is so small that 
even when no materialized model can be utilized and the 
model has to be constructed from the baseline, the impact 
of the optimizer to the overall performance is immaterial as 
evident in Figure At high coverage, the chances of utiliz¬ 
ing materialized models are much higher. In that case, the 
small overhead of the optimizer is clearly compensated by 
the large savings in model construction time. 

6.5 Accuracy 

In this section we analyze the accuracy of our framework 
for the logistic regression models presented in section]^ We 
quantify the accuracy of the overall approach. 

Synthetically generated classification data with 10 fea¬ 
tures and 2 classes were used to run test experiments. Sim¬ 
ilar trends hold when the number of classes increases, so we 
omit these experiments for brevity. We ran experiments on 
a test set S of a thousand queries. For each of these queries 
the model was built using our framework and also by ap¬ 
plying SGD. We compare the accuracy on training data for 
both models by computing their difference. Let A refer to 
the accuracy of the model built by our framework and Aq 
refer to accuracy of SGD algorithm, the accuracy difference 
can be represented as Aq — A. Various statistics are reported 
on this difference. Figure [6^ and [6^ presents the average of 
the accuracy difference between the model constructed by 
our approach and the model constructed by SGD directly. 


The x-axis represents queries in increasing order of size. The 
graphs show negative average values which means that on 
average the model generated by our framework outperforms 
the model developed by SGD on training data. Also as the 
query size increases the expected performance of our model 
improves. Figure [6b| and |6f| presents the average difference 
in accuracy for the cases where (Aq — A) > 0. It can be seen 
that the average positive difference lies within 0.5%. It is 
evident that the overall approach is highly accurate. Across 
the materialized model sizes we observe that larger size has 
better accuracy as compared to smaller sizes. Finally Fig¬ 
ures and present the maximum difference across var¬ 
ious query sizes. The graph shows that as the query size 
increases the maximum difference between the model com¬ 
puted by our framework and that computed by SGD de¬ 
creases. It is visible from the graph that max (Ao — A) < 3% 

. The last set of graphs presents the trade off between accu¬ 
racy and the corresponding performance gains achieved by 
our framework. As hgures [6^ and [Mi| suggest we experience 
a performance gain of 1.5x while we compromise accuracy 
by 3% in the worst case. Similar results were observed on 
real world data sets including the PAMAP2 publicly avail¬ 
able data set [^. Since they are consistent with what has 
been presented these results are omitted for brevity. 

7. RELATED WORK 

There has been an ever increasing interest to integrate sta¬ 
tistical and machine learning capabilities to data manage¬ 
ment systems. Several efforts have been made in academia 
and industry to address this demand. Major database ven¬ 
dors now support analytical capabilities on top their database 
engines : IBM’s SystemML , Oracle’s ORE SAP 
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HAN A [^. However the integration is loose and does not 
support notions of model persistence or incremental com¬ 
putations. In the open source community one can observe 
similar trends with MADLib [13| library support for Post- 
gres. Other data platforms like Spark and Hadoop also sup¬ 
port machine libraries as an external layer on top of their 
data processing system with MLLib and Mahout re¬ 
spectively. Such approaches either utilize an existing data 
management platform and deploy Its extensions to provide 
analytics capabilities or represent systems that can execute 
machine learning and statistical packages. See [11| for a gen¬ 
eral overview of systems support for machin e le arning and 
statistical operations. Haloop and Dryad 14 are exam¬ 
ples of systems that utilize a form of persistence in their 
operations to improve the execution of a graph data flow. 
Although related in spirit, the approach and goal of these 
systems is to improve the performance of specific iterative 
graph data flow computations; they do not address the case 
of synthesizing a new model by extending and/or combining 
past models which is central in our approach. 

Recent work [15| focused on pushing machine learning 
primitives inside a relational database engine. Our work 
is intended as a middle layer between the data processing 
engine and the analytical computing language layer. We re¬ 
quire awareness of previous computations by collecting them 
and explore materialized models to build new models for the 
data. Our goal is to explore natural work sharing opportu¬ 
nities that exist in a typical data analysis workload. 

Materializing portion of computations with the intention 
of reuse has also been explored in the domain of feature 
selection for machine learning tasks. Our work however 
explores the incremental updates and reuse of model to build 
new models. 


8. CONCLUSIONS 

In this paper we presented an approach that utilizes model 
materialization and incremental model reuse as a first class 
citizen while processing data analytics workloads. Utiliz¬ 
ing popular machine learning models we demonstrated their 
incremental aspects and detailed an optimization methodol¬ 
ogy that determines the best way (in terms of performance) 
to build a given new model. We demonstrated that our 
apporach can achieve signihcant savings in performance for 
new model construction while only imposing modest over¬ 
heads in storage. 

The work opens several avenues for future work. First 
there is a plethora of other models that are important and 
can be considered in conjunction with our framework. Study¬ 
ing their incremental aspects and embedding them into the 
same optimization framework is an interesting direction for 
future work. Incremental model reuse for analytics is an 
important direction of research that blends nicely with the 
way current data management systems build integrations to 
existing analytical packages. Our framework can be easily 
injected between the analytical package and the RDBMS 
and recognize as well as handle all opportunities for im¬ 
proved performance. We are currently building such as sys¬ 
tem based on the ideas presented herein in which we will 
report soon. 

Finally, our focus in this paper has been in the case that 
a total ordering exists in the underlying data set. An inter¬ 
esting case is when such an ordering does not exist. In that 


case the model descriptors will be different as well as the 
associated optimizations. Indeed our entire framework can 
be extended for this case as well and we will be reporting 
on such extensions in our future work. 
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